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ABSTRACT 

Nonreporting  of  income  in  the  Current  Population  Survey  is  an 
important  problem  affecting  the  many  researchers  using  the  data  base. 
This  paper  discusses  an  approach  to  handling  this  problem  proposed  by 
Lillard,  Smith  and  Welch,  which  applies  selection  models  to  a  Box-Cox 
transformation  of  the  income  variable.  Topics  considered  here 
include:  the  inadequacy  of  single  imputation  and  the  desirability  of 
multiple  imputation,  the  importance  of  the  distinction  between  ignorable 
and  nonignorable  nonresponse,  the  sensitivity  of  inference  to  assump¬ 
tions  unassailable  by  the  data  at  hand,  and  the  possibility  of  using  the 
CPS-SSA-IRS  Exact  Match  File  to  study  such  assumptions.  The  Lillard, 
Smith  and  Welch  paper  accompanied  by  this  discussion  is  to  appear  in  a 
book  presenting  the  proceedings  of  the  NBER  Labor  Cost  Conference  to  be 
published  by  the  University  of  Chicago  Press. 
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IMPUTING  INCOME  IN  THE  CPS:  COMMENTS  ON  "WHAT  DO  WE  KNOW  ABOUT  WAGES: 
THE  IMPORTANCE  OF  NON-REPORTING  AND  CENSUS  IMPUTATION"  BY 
LILLARD,  SMITH  AND  WELCH 


Donald  B.  Rubin 

1 •  Introduction. 

"What  do  we  Know  about  wages:  the  importance  of  non-reporting  and 
census  imputation"  by  Li Hard,  Smith  and  Welch  (LSW)  is  a  very  inter¬ 
esting  study  of  nonresponse  on  income  items  in  the  Census  Bureau's  Cur¬ 
rent  Population  Survey  (CPS).  LSW  points  out  that  the  CPS  is  a  major 
source  of  income  data  for  economic  research  even  thought  the  nonresponse 
rate  on  income  items  is  about  1 5%  -  20%.  This  level  of  nonreporting  of 
income,  especially  if  concentrated  among  special  types  of  individuals, 
should  be  of  substantial  concern  to  researchers  in  economics.  As  empha¬ 
sized  in  LSW,  however,  most  published  economic  research  ignores  this 
problem  when  using  CPS  data.  The  major  reason  that  researchers  can 
ignore  this  problem  is  that  before  CPS  public-use  tapes  are  released, 
the  Census  Bureau  imputes  (i.e.,  fills  in)  missing  income  data  (as  well 
as  other  data).  Although  imputed  data  are  flagged  to  distinguish  them 
from  real  data,  it  is  evidently  easy  for  researchers  to  be  seduced  into 
ignoring  this  distinction  and  treating  all  values,  imputed  and  real,  on 
the  same  basis. 

LSW  is  divided  into  three  main  sections.  In  the  first,  facts  are 
presented  concerning  the  CPS,  income  nonrespondents,  and  the  procedure 
used  by  the  Census  Bureau  to  impute  (i.e.,  the  "hot  deck").  In  the 
second  section,  a  statistical  model  is  formulated  to  explain  income 
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nonresponse,  specifically,  a  selection  model  using  Box-Cox  transforma¬ 
tions  to  normality.  The  third  section  summarizes  empirical  results 
obtained  when  the  selection  model  is  applied  to  CPS  data. 

My  discussion  of  LSW  will  roughly  follow  the  outline  in  LSW  with 
digressions  and  extensions.  My  sections  do  not,  however,  follow  in  one- 
to-one  correspondence  with  theirs.  After  characterizing  income  non¬ 
reporters  in  Section  2  and  describing  the  Census  Bureau's  hot  deck 
procedure  in  Section  3,  in  Section  4  I  point  out  the  need  for  multiple 
imputation  if  uncertainty  due  to  nonresponse  is  to  be  properly  reflected 
in  an  imputed  data  set.  Section  5  provides  definitions  of  ignorable  and 
nonignorable  nonresponse,  while  Section  6  describes  the  LSW  selection 
model  and  emphasizes  that  external  information  is  needed  to  justify  the 
acceptance  of  the  LSW  model  or  any  other  particular  model  for  non¬ 
response  as  an  accurate  reflection  of  reality.  Finally,  Section  7 
briefly  describes  the  CPS-SSA-IRS  Exact  Match  File,  which  might  be  used 
to  help  provide  such  external  information. 
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2.  Who  are  the  Nonrespondents  on  Income  Questions? 

Of  central  importance  for  determining  whether  the  15%  -  20%  non¬ 
response  rate  on  income  questions  is  of  major  concern  is  the  extent  to 
which  income  nonreporters  are  different  from  income  reporters.  If  the 
nonreporters  were  just  a  simple  random  sample  from  the  population  of 
reporters  and  nonreporters,  the  loss  in  efficiency  of  estimation  created 
by  ignoring  the  nonreporters  altogether  would  be  of  little  concern. 

There  is  a  great  deal  of  evidence,  however,  showing  that  nonre¬ 
porters  do  differ  from  reporters  in  important  ways.  One  such  piece  of 
evidence  that  LSW  presents  is  especially  interesting.  Apparently,  if  we 
were  to  plot  "probability  of  nonresponse  on  income  items"  vs.  "amount  of 
actual  income",  the  relationship  would  be  u-shaped:  moderate  nonre¬ 
sponse  at  low  incomes,  low  nonresponse  at  moderate  incomes  and  very  high 
nonresponse  at  high  incomes.  Moreover,  LSW's  evidence  suggests  that 
this  u-shaped  relationship  is  created  by  the  existence  of  two  primary 
types  of  income  nonreporters.  The  first  type  is  called  "general  nonre¬ 
porters"  because  they  have  a  high  nonresponse  rate  on  many  CPS  ques¬ 
tions,  not  just  income  questions.  These  people  tend  to  have  low  incomes 
and  approach  CPS  questions  in  a  generally  reluctant  manner.  The  second 
type  of  income  nonreporter  is  called  "specific  nonreporters"  because  on 
most  CPS  questions,  that  is  non-income  questions,  they  have  low  nonre¬ 
sponse  rates,  whereas  on  income  questions  their  nonresponse  rates  are 
very  high  (e.g.,  over  30%).  The  specific  nonreporters  tend  to  be  pro¬ 
fessionals  with  high  incomes,  for  example,  doctors,  lawyers,  and 


dentists 


If  we  accept  this  interesting  picture  as  relatively  accurate,  it 
seems  to  me  natural  and  desirable  to  try  to  build  a  nonresponse  model 
that  explicitly  recognizes  the  u-shaped  relationship  and  the  two  types 
of  income  nonreporters.  LSW,  however,  does  not  exploit  this  structure 
in  its  models,  and  instead  uses  a  model  for  nonresponse  asserting  that 
conditional  on  some  predictor  variables  (such  as  years  of  education), 
the  relationship  between  probability  of  nonresponse  on  income  items  and 
income  is  monotonic.  Of  course  one  can  criticize  virtually  any  analysis 
for  not  fully  exploiting  some  interesting  features  found  in  subsequent 
analyses.  Consequently,  my  comment  on  this  point  should  be  viewed  more 
as  offering  a  suggestion  for  further  study  than  as  criticizing  the  work 
presented  in  LSW. 


3.  The  Census  Bureau *s  Hot  Deck  Imputation  Scheme. 

LSW  provides  an  exceptionally  clear  and  lucid  discussion  of  the 
Census  Bureau's  procedure  for  imputation,  the  hot  deck,  which  has  been 
used  since  the  early  1960's.  The  hot  deck  is  a  matching  algorithm  in 
the  sense  that  for  each  nonrespondent,  a  respondent  is  found  who  matches 
the  nonrespondent  on  variables  that  are  measured  for  both.  The  vari¬ 
ables  used  for- the  matching  are  all  categorical,  with  varying  numbers  of 
levels  (e.g. ,  "gender"  has  two  levels,  "region  of  country"  has  four 
levels).  If  a  match  is  not  found,  categories  are  collapased  and  vari¬ 
ables  are  deleted  so  that  coarser  matches  are  allowed.  Eventually, 
every  nonrespondent  finds  a  match;  the  matching  respondent  is  often 
called  (by  hot  deck  aficionados)  "the  donor"  because  the  donor's  record 
of  values  is  donated  to  the  nonrespondent  to  fill  in  all  missing  values 
in  the  nonrespondent's  record. 

LSW  points  out  that  the  number  of  variables  used  for  matching  and 
their  level  of  detail  has  expanded  over  the  years,  and  that  imputed 
income  can  be  sensitive  to  such  rule  changes.  For  example,  between  1975 
and  1976,  years  of  education  was  added  to  the  list  of  matching  vari¬ 
ables,  and  as  a  consequence,  the  imputed  incomes  of  nonrespondents  with 
many  years  of  education  increased  substantially  from  1975  to  1976.  Such 
changes  can  create  problems  when  comparing  income  data  in  different 
periods  of  time.  A  related  problem  is  that  even  though  the  ideal  match 
that  is  possible  under  the  hot  deck  is  closer  now  than  it  was  years  ago, 
many  nonrespondents  fail  to  find  donors  at  this  ideal  level  of  detail. 
For  one  example,  only  20%  find  donors  in  the  same  region  of  the 
country.  For  a  second  example,  judges  with  ideal  matches  are  imputed  to 
earn  approximately  $30,000  more  than  judges  without  ideal  matches. 
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The  hot  deck,  by  trying  for  exact  multivariate  categorical  matches 


is  trying  to  control  all  higher  order  interactions  among  the  matching 
variables.  This  task  is  very  difficult  with  many  matching  variables 
when  using  a  categorical  matching  rule,  even  if  there  is  a  large  pool  of 
potential  matches  for  the  non respondents.  Related  work  on  matching 
methods  in  observational  studies  investigates  categorical  matching 
methods  and  offers  alternative  matching  methods  (e.g.,  Cochran  and 
Rubin,  1973;  Rubin,  1976a,  1976b,  1980a;  Rosenbaum  and  Rubin,  1981).  I 
suspect  that  some  of  the  more  recent  work  (e.g.,  Rosenbaum  and  Rubin, 
1981)  may  have  useful  suggestions  for  an  improved  hot-deck- like 
procedure.  LSW  does  not  suggest  modifying  the  matching  algorithm  but 
rather  suggests  using  explict  statistical  models. 
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4.  The  LSW  Alternative  to  the  Hot  Deck  and  the  Need  for  Multiple 
Imputation. 

LSW  suggests  an  alternative  to  hot  deck  imputation:  (a)  build  an 
explicit  model  (described  here  in  Section  6),  (b)  estimate  the  param¬ 
eters  of  this  model,  and  (c)  impute  by  randomly  drawing  observations 
from  this  model  with  unknown  parameters  replaced  by  estimates.  Before 
proceeding  to  describe  the  particulr  model  LSW  uses,  I  have  several 
general  comments  to  make  in  this  section  and  the  next. 

First,  for  the  data  producer,  some  form  of  imputation  is  almost 
required  and  often  desirable  even  if  not  required.  I  believe  the  Bureau 
feels  it  cannot  produce  public-use  files  with  blanks.  Also,  I  believe 
it  feels,  and  rightly  so,  that  it  knows  more  about  the  missing  data  than 
the  typical  user  of  public-use  files.  Furthermore,  the  typical  user  of 
public-use  files  will  not  have  the  statistical  sophistication  needed  to 
routinely  apply  model-based  methods  for  handling  nonresponse,  such  as 
those  reviewed  by  Little  (1982).  Of  course,  in  any  public-use  file,  all 
imputed  values  must  be  flagged  to  distinguish  them  from  real  values. 

Second,  imputation  based  on  explicit  modelling  efforts  may  require 
much  more  work  than  implicit  models  such  as  the  hot  deck  (or  some  other 
matching  method  for  imputation)  that  can  impute  all  missing  variables  at 
once  no  matter  what  the  pattern  of  missing  variables.  Of  course,  this 
does  not  mean  that  explicit  models  should  be  avoided:  explicit  model- 
based  methods  are,  in  principle,  the  proper  ones  to  handle 
nonresponse. 
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Third,  when  drawing  values  to  impute,  in  order  to  obtain  inferences 
with  the  correct  variability,  parameters  of  models  must  not  be  fixed  at 
estimated  values  but  must  be  drawn  in  such  a  way  as  to  reflect 
uncertainty  in  their  estimation. 

Fourth,  one  imputation  for  each  missing  value,  even  if  drawn 
according  to  the  absolutely  correct  model,  will  lead  to  inferences  that 
underestimate  variability  (e.g.  underestimate  standard  errors). 

Fifth,  there  exists  a  need  to  display  sensitivity  of  answers  to 
plausible  models  for  the  process  that  creates  nonresponse  since  the 
observed  data  alone  cannot  determine  which  of  a  variety  of  models  is 
correct. 

These  points  are  all  leading  to  the  suggestion  to  use  multiple 
imputation  as  proposed  in  Rubin  (1978a)  and  expanded  upon  in  Rubin 
(1980b).  Whether  using  an  implicit  model,  such  as  the  hot  deck,  or  an 
explicit  model  such  as  employed  in  LSW,  if  imputation  is  used  to  handle 
nonresponse,  multiple  imputation  is  generally  needed  to  reach  the 
correct  inference. 

Multiple  Imputation  replaces  each  missing  value  by  a  pointer  to  a 
vector,  say  of  length  m,  of  possible  values;  the  m  values  reflect 
uncertainty  for  the  correct  value.  Imputing  only  one  value  can  only  be 
correct  when  there  is  no  uncertain! ty,  but  if  there  were  no  uncertainty, 
the  missing  value  would  not  be  missing;  consequently,  multiple  imputa¬ 
tion  rather  than  single  imputation  is  needed  when  there  are  missing 
data. 

The  m  possible  values  for  each  of  the  missing  data  result  in  m 
complete  data  sets,  and  these  can  be  analyzed  by  standard  complete-data 
methods  to  arrive  at  valid  inferences.  Suppose  for  example  that  the 


m  imputations  were  all  made  under  one  model  for  nonresponse,  such  as 


the  LSw  selection  model,  and  suppose  that  with  complete  data  we  would 

A  A 

form  the  estimate  Q  with  associated  standard  error  S.  Let  and 

S^,  i  =  1,***,m  be  their  values  in  each  of  the  data  sets  created  by 
multiple  imputation.  Then  the  resultant  multiple  imputation  estimate  is 
simply  Q  =  EQ^/m  with  standard  error  /ECQ^  -  Q)  /(m  -  1)  +  £S^/m  . 

If  the  m  imputations  are  from  k  different  models,  then  those 
imputations  under  each  model  should  be  combined  to  form  one  inference 
under  each  model,  and  then  the  comparison  across  the  k  resulting 
inferences  displays  sensitivity  of  inference  to  the  k  different 
models. 
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5.  The  Distinction  between  Ignorable  Nonresponse  and  Nonignorable 
Nonresponse. 

Before  introducing  the  LSW  model  and  presenting  its  implications,  I 
think  that  it  is  important  to  expand  on  the  general  issue  of  the  kinds 
of  models  that  can  be  built  for  survey  nonresponse.  Such  models  can  be 
classified  into  ones  with  "ignorable"  nonresponse  and  those  with 
"nonignorable"  nonresponse,  the  terminology  being  due  to  Rubin  (1976c, 
1978b).  I  believe  that  LSW's  use  of  "random  nonresponse"  is  intended  to 
convey  essentially  the  same  notion,  although  I  find  the  LSW  use  of  this 
phrase  somewhat  inconsistent. 

Under  ignorable  nonresponse  models,  respondents  and  non respondents 
that  are  exactly  matched  with  respect  to  observed  variables  have  the 
same  distribution  of  missing  variables.  The  Census  Bureau  hot  deck 
operates  under  this  assumption  although  it  does  not  have  to  do  so.  For 
example,  having  found  a  donor  for  a  nonrespondent,  instead  of  imputing 
the  donor's  income,  the  hot  deck  algorithm  could  be  instructed  to  impute 
the  donor's  income  plus  ten  percent.  If  we  accept  the  Census  Bureau's 
hot  deck  as  currently  iroolemented,  then  we  implicitly  accept  the  hypoth¬ 
esis  that  nonresponse  is  ignorable,  and  then  there  is  no  need  to  be 
concerned  with  selection  models,  such  as  that  used  in  LSW.  Instead, 
under  ignorable  nonresponse,  all  energy  should  be  focused  on  modelling 
the  conditional  distribution  of  missing  variables  given  observed  vari¬ 
ables  for  respondents,  since,  by  assumption,  this  conditional  distribu¬ 
tion  is  the  same  for  nonrespondents  and  respondents.  If  missing  values 
are  to  be  replaced  by  imputed  values,  however,  whether  these  values 
arise  from  implicit  or  explict  models,  a  single  imputation  generally 
will  underestimate  variability.  Consequently,  the  LSW  statement 
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accepting  the  hot  deck  if  operating  at  its  most  detailed  level  is  not 
entirely  appropriate  if  valid  inferences  are  desired,  even  if 
nonresponse  is  ignorable. 

Under  nonignorable  nonresponse  models,  respondents  and  nonre¬ 
spondents  perfectly  matched  on  observed  variables  have  different  distri¬ 
butions  on  unobserved  variables.  The  example  of  the  modified  hot  deck 
which  imputes  donor's  income  plus  ten  percent  is  an  implicit  nonignor¬ 
able  nonresponse  model;  the  LSW  selection  model  is  an  explicit  nonignor¬ 
able  model.  When  nignorable  nonresponse  is  possible,  as  with  income 
nonreporting  in  the  CPS,  it  is  crucial  to  expose  sensitivity  of  answers 
to  different  models,  all  of  which  are  consistent  with  the  data.  An 
important  contribution  of  the  present  paper  is  that  it  defines  and 
illustrates  the  use  of  an  expanded  collection  of  such  models. 

Within  the  context  of  imputation  for  missing  values,  sensitivity  to 
models  can  only  be  exposed  through  the  use  of  multiple  imputation,  where 
for  each  missing  value  there  are  imputations  under  each  model  being 
considered  (e.g.,  two  imputations  under  the  ignorable  hot  deck,  two 
imputations  under  the  nonignorable- (plus  ten  percent) -model,  and  two 
imputations  under  the  LSW  nonignorable  selection  model).  Again,  such 
multiple  imputations  are  necessary  in  order  to  reach  valid  inferences 
under  each  model  and  to  expose  sensitivity  of  answers  to  population 
features  not  addressable  by  the  observed  data. 
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The  LSW  Nonignorable  Model  and  Analysis 


Let  Y  be  earnings,  which  is  sometimes  missing  in  the  CPS,  and 

let  X  be  a  vector  of  predictor  variables  (e.g. ,  education,  work 

experience),  which  evidently,  is  assumed  to  be  always  observed  in  the 

CPS.  Define  Y*  to  be  the  Box-Cox  (1964)  transformed  earnings 
*  0 

(Y  *  (Y  -1 )/“ ),  Z  to  be  an  unobserved,  hypothetical  variable  such 
that  Y  is  missing  if  Z  >  0,  and  suppose  (Y*,Z)  given  X  is 
bivariate  normal  with  correlation  p. 

If  p  *  0,  nonresponse  is  ignorable,  whereas  if  p  *  0 
nonresponse  is  nonignorable;  as  |p|  ♦  1,  the  extent  of  nonignorable 
nonresponse  becomes  more  serious  in  the  sense  that  the  observed  distri¬ 
bution  of  Y*  for  respondents  becomes  less  normal  and  more  skewed. 

This  defines  the  LSW  model,  and  LSW  obtains  maximum  likelihood  estimates 
for  all  parameters,  explicitly  recognizing  the  truncation  of  Y  at 
$50,000  in  the  CPS.  A  quite  similar  model  with  8*0  (Y*  *  log(Y))  is 
applied  to  CPS  income  data  in  Greenlees,  Reece,  and  Zieschang  (1982). 

The  extension  to  other  8  is  certainly  interesting  and  potentially 
quite  useful.  Of  particular  importance,  it  gives  users  a  broader  range 
of  models  for  nonresponse  to  which  sensitivity  of  estimation  can  be 
investigated. 

It  must  not  be  forgotten,  however,  that  the  estimation  of  param¬ 
eters  is  relying  critically  on  the  assumed  normality  of  the  regression 
of  (Y*,Z)  on  X:  both  8  and  P  are  chosen  by  maximum  likelihood  to 
make  the  residuals  in  this  regression  look  as  normal  as  possible.  If  in 
the  real  world  there  is  no  (8,p)  that  makes  this  regression  like  a 
normal  linear  regression,  then  there  is  no  real  reason  to  believe  that 
the  answers  that  are  obtained  by  maximizing  over  8  and  p  lead  to 


better  real  world  answers.  A  small  artificial  example  I've  used  before 

(Rubin,  1978)  illustrates  this  point  in  a  simpler  context: 

Suppose  that  we  have  a  population  of  1000  units,  try  to  record 
a  variable  Z,  but  half  of  the  units  are  nonrespondents.  For 
the  500  respondents,  the  data  look  half-normal.  Our  objective 
is  to  know  the  mean  of  Z  for  all  1000  units.  Now,  if  we 
believe  that  the  nonrespondents  are  just  like  the  respondents 
except  for  a  completely  random  mechanism  that  deleted  values 
(i.e.,  if  we  believe  that  mechanisms  are  ignorable),  the  mean 
of  the  respondents,  that  is,  the  mean  of  the  half-normal 
distribution,  is  a  plausible  estimate  of  the  mean  for  the  1000 
units  of  the  population.  However,  if  we  believe  that  the 
distribution  of  Z  for  the  1000  units  in  the  population  should 
look  more  or  less  normal,  then  a  more  reasonable  estimate  of 
the  mean  for  the  1000  units  would  be  the  minimum  observed  value 
because  units  with  Z  values  less  than  the  mean  refused  to  re¬ 
spond.  Clearly,  the  data  we  have  observed  cannot  distinguish 
between  these  two  models  except  when  coupled  with  prior 
assumptions,  (p.  22} 

Notwithstanding  the  above  caveats,  suppose  we  put  our  faith  in  the 
normal  linear  model  for  the  bivariate  regression  of  (Y*,Z)  on  X.  LSW 
produce  some  interesting  empirical  results  using  white  males,  16-65  years 
old,  in  the  1970,  1975,  1976  and  1980  CPS.  One  interesting,  but  not 
surprising,  result  is  that  fixing  8  at  1  (Y*  *  Y)  produces  very 
different  answers  from  fixing  8  at  0(Y*  =  log(Y));  if  8  *  i, 
nonrespondents  are  imputed  to  earn  less  than  matching  respondents, 
whereas  if  8=0,  nonrespondents  are  imputed  to  earn  more  than 
matching  respondents.  With  8  fixed,  the  asymmetry  in  the  Y*  given 
X  residuals  addresses  the  correlation  p  and  so  determines  the  extent 
to  which  the  nonresponse  is  nonignorable.  Thus,  we  have  learned  that 
the  Y  given  X  residuals  are  skewed  left  and  the  log(Y)  given  X 
residuals  are  skewed  right.  Further  study  shows  that  8  ■  .45  provides 
a  better  fit  to  the  data  than  either  8  »  0  or  8  *  1,  but  that  the 
residuals  are  still  skewed  right:  under  0  *  ,45  we  find  that  nonre¬ 
spondents  are  imputed  to  earn  more  than  similar  respondents:  6  a  .45 


leads  to  a  10%  increase  in  average  earnings  over  the  CPS  hot  deck 
values,  $18,000  vs.  $16,000. 

But  we  must  remember  that  if  the  distribution  of  Y‘*  given 

(  45 ) 

X  really  has  the  right  asymmetry  that  is  observed  when  Y  is 

regressed  on  X,  then  the  adjustment  created  by  assuming  a  selection 
effect  on  Z  is  entirely  inappropriate,  and  (just  as  with  the  arti¬ 
ficial  half  normal  example)  the  data  cannot  distinguish  between  the 
ignorable  and  nonignorable  alternatives.  More  precisely,  suppose  first 
that  in  the  population,  y(*45)  has  a  iinear  regression  on  X  with  a 
skew  distirbution  of  residuals  like  that  observed  when  we  regress 
Y^*45^  on  X  for  the  CPS  data  and  that  nonresponse  is  ignorable;  such 
a  model  would  generate  data  just  like  those  we  have  observed,  and  then 
we  should  not  be  imputing  higher  incomes  for  nonrespondents  than 
respondents  with  the  same  X  values. 

In  contrast,  suppose  that  Y' *  ""  in  the  population  really  has  a 
normal  linear  regression  on  X  and  that  the  stochastic  censoring 
implied  by  the  LSW  probit-nonresponse  model  in  correct,  i.e., 
nonresponse  is  nonignorable  with  this  particular  form;  then  as  LSW 
shows,  we  should  be  imputing  higher  incomes  for  nonrespondents  than 
respondents  with  the  same  X  values.  There  is  no  way  that  the  observed 
data  can  distinguish  between  these  two  alternatives;  if  the  authors 
really  believe  Y*  given  X  in  the  population  is  normal  for  some 
0,  then  they  can  correctly  assert  that  the  CPS  hot  deck  procedure  is 
biased.  If  they  admit  the  possibility  that  Y*  given  X  is  not  normal 
or  even  symmetric  for  any  0,  then  they  cannot  legitimately  assert  that 
their  answers  are  better  than  the  CPS  answers. 
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In  the  same  vein,  LSW's  checking  the  accuracy  of  the  LSW  model  by 
checking  the  prediction  of  respondents'  values  really  does  not  ade¬ 
quately  check  the  imputations  of  the  model  for  nonrespondents.  In 
particular,  both  the  ignorable  and  nonignorable  nonresponse  models 
discussed  above  will  accurately  reproduce  the  observed  data  for 
respondents,  but  will  give  very  different  results  for  nonrespondents. 

In  order  to  address  which  model  is  more  appropriate,  we  really  need  data 
from  nonrespondents  or  some  external  information  about  the  distribution 
of  reported  incomes  in  the  entire  population. 
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'.  The  CPS-SSA-IRS  Enact  Match  File. 

There  is  a  data  set  that  provides  data  relevant  to  accessing  the 
differences  in  distributions  of  incomes  between  CPS  nonrespondents  and 
respondents.  This  data  set  is  the  CPS-SSA-IRS  (SSA  =  Social  Security 
Administration;  IRS  =  Internal  Revenue  Service)  Exact  Match  File  (Aziz. 
Kilss,  and  Scheuren,  1978).  The  exact  match  file  is  based  on  a  sample 
of  1978  CPS  interviews  with  incomes  obtained  from  SSA  and  IRS  admin¬ 
istration  records.  Thus,  this  file  is  a  data  set  consisting  of  CPS 
respondents  and  nonrespondents  with  administrative  income  data  always 
observed.  By  treating  CPS  nonrespondents'  administrative  income  data  as 
missing  and  applying  specific  methods  for  handling  nonresponse,  we  do  in 
fact  obtain  some  evidence  for  the  adequacy  of  these  specific  techniques 
for  adjusting  for  nonresponse  bias,  although  admittedly  for  administra¬ 
tive  income  rather  than  CPS  reported  income.  Two  papers  doing  this  will 
be  mentioned. 

Herzog  and  Rubin  (1982)  compare  the  imputations  from  the  CPS  hot 
deck  and  an  explicit  two-stage  linear/log-linear  model;  they  also 
evaluate  the  utility  of  multiple  imputation  for  obtaining  proper 
inferences.  This  paper's  objective,  however,  is  to  predict  Social 
Security  Benefits  rather  than  total  income,  and  so  its  results  do  not 
address  the  same  kind  of  income  nonresponse  as  studied  in  LSW. 

A  highly  relevant  paper,  however,  is  Greenlees,  Reece  and  Zieschang 
(1982),  which  also  studies  earned  income.  Not  only  does  this  paper  use 
essentially  the  same  selection  model  as  LSW  with  the  restriction  9  *  0 
(i.e.,  income  is  log  normal),  but  it  also  handles  the  truncation  of 
income  at  $50,000  using  maximum  likelihood  techniques.  Interesting 
conclusions  of  this  article  are  that  (a)  the  model  predicts 
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nonrespondent  incomes  rather  well,  (b)  the  true  residuals  in  the  log 
scale  for  the  entire  population,  although  not  normal,  are  approximately 
symmetric,  and  (c)  the  CPS  hot  deck  underestimates  income  by  about 
7%.  These  results  lend  modest,  although  mixed,  support  to  the  utility 
of  the  LSW  selection  model  for  CPS  income  data. 

The  results  of  applying  the  LSW  techniques  to  the  Exact  Match  Pile 
would  certainly  be  of  interest.  Of  particular  importance,  such  an 
application  could  help  to  combat  criticism  based  on  the  fact  that  the 
CPS  data  alone  cannot  be  used  to  select  which  model  for  nonresponse  is 
truly  appropriate. 

This  suggestion,  however,  should  certainly  not  be  taken  as  indic¬ 
ative  of  a  fatal  flaw  in  LSW.  LSW  is  an  excellent  paper.  It  is  clearly 
written  and  demonstrates  careful  thought  on  important  issues  and  funda¬ 
mental  understanding  of  the  CPS  and  the  Census  Bureau's  hot  deck.  More¬ 
over,  it  describes  extended  statistical  tools  for  handling  the  problem 
of  nonresponse,  and  applies  these  tools  to  an  important  real  world 
problem.  LSW  fits  in  very  well  with  other  important  contributions  on 
nonresponse. 
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